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High-Performance  Scalable  Base-4  Fast  Fourier  Transform  Mapping 

Dr.  J.  Greg  Nash,  Centar 
(jgregnash@centar.net,  www.centar.net) 


A  novel,  scalable  parallel  FFT  architecture  mapping  is  described  here  that  supports  transform  lengths  which  aren’t 
powers  of  two  or  four,  that  provides  low  latency  as  well  as  high  throughput,  that  can  do  both  1-D  and  2-D  discreet  Fourier 
transforms  (DFTs),  that  is  ideally  suited  to  today’s  complex  FPGA  architectures,  that  possesses  all  the  regularity  and  design 
simplicity  of  systolic  arrays  and  that  is  naturally  suited  to  a  parameterized  HDL  form.  Its  algorithmic  underpinnings  are  based 
on  an  observation  that  with  suitable  permutations,  the  DFT  coefficient  matrix  can  be  partitioned  into  regular  blocks  of  smaller 
”base-4”  matrices  (equivalent  to  a  decimation  in  time  and  frequency)  [1].  From  this  new  base-4  matrix  DFT  description  we 
have  derived  a  new  latency  and  throughput  optimal  base-4  FFT  architecture.  It  combines  the  performance  of  traditional  radix- 
4  ’’pipelined  FFTs”  with  the  design  and  implementation  simplicity  of  systolic  arrays,  and  yet  is  versatile. 

An  N  point  DFT  is  defined  by 

Z[k]=  z  X[n]  k  =  l  2...N  or  Z  =  CX  (1) 

/7—1 

where  X[n]  are  the  time  domain  input  values,  Z[k]  are  the  frequency  domain  outputs  and  C  is  a  coefficient  matrix  containing 

elements  W^1  =  e  2jK^n  1)(^  1)/A\  In  order  to  transform  C  into  the  desired  base-Z?  (b=4)  format  it  is  necessary  to  find  a 
permutation  matrix  P  that  reorders  X  and  Z  according  to 

Xb=P  [X!  V2  V3  V4  X5...XN_3  XN_2  XN_x  xj 

x2-x> 
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(2) 


and  Zb  =  P  Z  .  With  this  value  of  P,  C  can  be  transformed  into  Cb  =  PCP  ,  so  that  Zb  =  CbXb 
allows  Cb  to  be  written  as  an  ( N  /  b)  X  ( N  /  b)  array  of  bxb  blocks,  each  block  Cb  [/,  j]  specified  by 


This  transformation 


Cb[i,j]=WM[i,j]*c 


1)  mod(6))+lC(0— 1)  mod(6))+l 


where  cDi  =  c  ■  /  with  cz  a  Z?-element  vector,  C-  is  a  bxb  matrix,  each 


row  being  cz*  ,  and  WM  [/,  j]  is  an  element  in  the  (N  /  b)  X  (N  /  b )  matrix, 
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With  this  block  reformulation  it  is  possible  to  factor  (1)  into 


y=wm.cmix 
Z  =  CmiY' 


(4) 


2 

where  ”  •  ”  in  (4)  corresponds  to  an  element  by  element  multiply  [2].  In  (4)  CMl  and  CM2 contain  N lb  submatrices 
Cb  =  [ci  I  c2  I  •••  I  cb\  f°rm  CM i  =  | "_CB  |  CB  |  ...J  and  CM2  =  \CB  \  CB  | ...] ,  and  Z,  Xhave  been  redefined  j 


follows: 


Z  = 
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and  CB  describes  a  radix-4  decimation  in  time  butterfly. 

By  comparing  (4)  with  (1),  the  computational  advantages  of  the  manipulation  leading  to  the  FFT  algorithm  form  (4)  for 
base-4  designs  are  readily  evident.  In  (4)  the  matrix  products  CM1X  and  CM2Yl 2  involve  only  exchanges  of  real  and 
imaginary  parts  plus  additions  because  the  elements  of  CMl  and  CM2  contain  only  ±1  or  ±  j  ,  whereas  the  product  CX in  (1) 
requires  complex  multiplications.  Also,  the  size  of  the  coefficient  matrix  WM  in  (4)  is  ( N  /  b)  x  ( N  /  b)  vs.  the  NxN  size  of 
C  in  (1);  consequently  the  number  of  overall  direct  multiplications  in  (4)  is  reduced  by  a  factor  of  xl6  compared  to  the  direct 
form  (1)  on  which  past  systolic  FFT  implementations  are  based.  Note  that  distribution  of  the  elements  in  CMl  and  CM2  does 

not  impose  significant  bandwidth  requirements  because  full  complex  numbers  are  not  used. 

The  overall  processing  for  an  M  point  FFT  is  done  using  the  factorization  M=N1N2>  followed  by  a  series  of 
”row/column”  1-D  FFTs  on  Nj  and  N2.  Each  of  these  1-D  FFTs  is  performed  by  a  secondary  factorization  into  2-D  FFTs 
according  to  (4)  and  again  using  a  "row/column”  approach.  The  first  equation  in  (4)  is  equivalent  to  performing  a  ’’column” 
FFT  and  the  second  equation  is  equivalent  to  performing  a  ’’row”  FFT.  In  between  there  is  a  ’’twiddle  factor”  multiplication 
by  the  W  in  WM  (3).  Because  both  the  row  and  column  FFTs  in  the  secondary  factorization  (4)  are  broken  down  entirely  into 
sets  of  4-point  DFTs,  they  can  be  done  without  complex  multiplications.  Also,  the  usual  matrix  transpose  in  between  column 
and  row  DFTs  is  not  necessary. 

A  systolic  array  architecture  mapping  was  performed  using  the  mathematical  formulation  (4)  as  input  to  the  mapping 
tool  SPADE  [2].  Behavioral  simulations  of  this  architecture  using  a  register  transfer  level  simulator  verify  its  operation. 
Performance  estimates  of  the  FFT  computation  times  are  shown  in  Table  1  for  a  variety  of  transform  sizes. 


Size  (points) 

T  (cycles/DFT) 

T  (psec/DFT) 

Multipliers 

Adders 

256 

210 

1.0 

4 

32 

512 

274 

1.3 

8 

64 

1024 

658 

3.1 

8 

64 

2048 

914 

4.3 

16 

128 

4096 

2322 

10.8 

16 

128 

8192 

3346 

15.6 

32 

256 

Table  1.  Performance  estimates  and  arithmetic  requirements  for  various  transform  sizes  (16-bits  fixed  point)  based  on  a  partially 
populated  Altera  Stratix  EP1S60  "medium  speed  grade”  FPGA  chip.  In  this  table  "T"  is  the  throughput.  (Computational  latency  for  each 
transform  size  above  is  approximately  equal  to  the  inverse  of  the  throughput  time/DFT). 

Although  Table  1  only  shows  transforms  that  are  powers  of  two,  the  base-4  FFT  lengths  are  not  limited  to  powers  of  two  or 
four.  For  example,  the  base-4  FFT  is  capable  of  29  transform  lengths  from  256  to  65,536  vs.  only  5  possible  lengths  for  a 
radix-4  pipelined  FFT. 

The  single  biggest  drawback  to  past  use  of  systolic  arrays  has  been  the  substantial  arithmetic  hardware  that  is  normally 
required  because  systolic  approaches  use  a  number  of  complex  multipliers  equal  to  the  size  of  the  transform.  Thus,  a  1024- 
point  DFT  would  require  1024  complex  multipliers,  compared  to  the  8  multipliers  shown  in  Table  1  for  the  base-4  FFT. 

Traditional  ’’pipelined”  FFTs,  although  computationally  efficient,  are  difficult  to  map  into  VLSI  because  in  general 
each  butterfly,  delay/commutator,  and  twiddle  factor  ROM  has  a  different  circuit  design  and/or  its  operation  varies  from  stage 
to  stage.  Also,  the  butterflies  do  not  usually  work  with  100%  resource  efficiency,  the  designs  are  limited  to  transform  lengths 
that  are  powers  of  two  or  four,  they  are  architecturally  suited  only  for  a  1-D  DFT  or  2-D  DFT  but  not  both,  and  it  is  difficult 
to  build  scalable  designs  because  of  their  irregularity  and  large  granularity.  Finally,  the  latency  (time  to  do  the  first  DFT  in  a 
series)  is  low  because  the  pipeline  has  to  be  ’’filled”  first.  Alternatively,  the  base-4  FFT  architecture  is  comprised  of  simple, 
identical,  small  processing  elements  (PEs),  arranged  in  regular  arrays  with  each  PE  operating  at  near  100%  efficiency. 
Performance  figures  in  Table  1  compare  very  well  to  larger  custom  ASIC  designs  and  recent  FPGA  implementations. 
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Base-4  Matrix  Equation 
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DFT  matrix  equation  becomes 
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Base-4  DFT  Matrix  Equation 
(Compact  Form) 
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•  General  Form 
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Base-4  DFT  Equation  Characteristics 


Coefficient  matrices  represent  series  of  4-point  transforms: 
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=>  Takes  advantage  of  reduced  arithmetic  with  radix  r  =  4  butterfly,  but 
transform  length  not  limited  to  N  -  r  m 

=>  Transform  length  must  be  divisible  by  16 

•  CM 1  and  Cm  2  contain  only  elements  from  the  set  {1,-1,-/,/} 

=^>  CM 1 X  and  CM 2^  only  involve  complex  additions 

•  Twiddle  factor  matrix  WM  is  of  size  A//4  x  A//4  rather  than  Nx  N 

=>  x16  fewer  multiplies  than  original  DFT  equation  ( Z=CX) 


Systolic  Array  Example:  Matrix  Multiply 


•  Algorithm: 

c[i, 7']=X d\i’ /c] * 4k,  j]  for  \<i,j,k<N 

k-\ 


•  Space-time  mapping:  computations  at  {i,j,k}  “mapped”  to  indices 
{time,x,y} 


Systolic  Array:  Each  intersection  point  corresponds 
to  a  “processing  element”  (PE)  that  receives  data 
from  its  neighbors,  does  a  multiply-add,  and  passes 
the  result  to  adjacent  PEs,  once  per  time  cycle. 


“Space-Time”  View 
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Find  Systolic  Architecture  Using  SPADE 
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Mathematical 

Algorithm 


for  j  to  N/4  do 
for  k  to  N/4  do 

Y [ j , k] : =WM[ j , k] *add (CMl [ j , i] *X [i ,k] ,i=l. .4) ; 
od; 

for  k  to  4  do 

Z [k, j ]  :=  add (CM2 [k,i]*Y[j,i] , i=l . .N/4) ; 

od 
od; 


Variable  position, 
area,  regularity,  bandwidth 


Input 
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Search  for  Space-Time 
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Simulator, 

Graphical 

Outputs 


v  e  { X,  y  ,  Z,  CM  1 ,  CM  2,  WM } 


tSymbolic  Parallel  Algorithm  Development  Environment 


SPADE  Functionality 


•  SPADE  accepts  input  statements  of  the  affine  form 


x(AxI  +  ax)  depends  on  y(B yI  +  b  ) 


e.g.,x(2iJ  +  \)  =  x{ 


2 
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1 


) 


for  all  I  tV  (I ) 


-  Where  Ax,By/ax,by  are  integer  matrices/vectors,  S  is  the  dimension  of 
the  algorithm  space  and  the  “depends  on”  includes  commutative  and 
associative  operators:  min ,  max,  X,  77 


•  SPADE  finds  latency  optimal  systolic  designs  subject  to  constraints 
imposed  by  scheduling,  localization,  reindexing,  and  allocation 


Secondary  objective  functions  used  to  select  architectures  are 
minimum  area,  maximum  regularity  and  minimum  network  bandwidth 


Systolic  Array  Designs:  Minimum  Area 


CM2 


N/4=16 


Space-Time  Views  (N=64) 


Multipliers  Adders  4 

Example  Systolic  Array  Views  (A/=64) 


•  Latency  (cycles)  =  NI2  +  8 

•  Six  unique  designs 

•  Throughput  (cycles/block)  =  NI4  +  6 

•  WM  mapped  to  same  space-time  location  as  Y 

•  IM1  and  IM2  variables  (SPADE  created)  perform  matrix 
multiply/adds 


Systolic  Array  Designs:  Maximum  Regularity 


Two  unique  designs  found 
Throughput  and  latency  optimal 
Latency  (cycles)  =  NI2  +  8 
Throughput  (cycles/block)  =  NI4  +1 

WM  mapped  to  same  space-time  position 
as  Y 


Systolic  Arrays  (N=32) 


Y  =W*  C  X 
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7  -C  Yr 
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Y 
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CM2 
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0  0 
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time 


Transformations 


CM2 


Space-time  view  (N=32) 


Systolic  Architecture  to  Array  Design 


Systolic  Architecture  (N=32) 


Array  Design  (N=32) 


□  Processing  Element  1:  2  registers,  1  adder 

□  Multiplier 

□  Processing  Element  2:  2  registers,  1  adder 
t  — ►  Data  flow  bus 


Altera  Stratix  FPGA:  DFT  Mapping 
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ID  FFT  via  Factorization 


•  Factor  N-N1*N2 

•  Creat  a  2-D  matrix  with  N1  rows  by  N2  columns,  ^assume  N1  >  N2), 

•  Do  N2  1-D  “column”  DFTs  followed  by  N1  “row”  DFTs: 

Y=Wm*X 
Y'  =  WU*Y 
Z=Y*Wn2 

•  If  N1 «  N2  then  (linear)  array  size  can  be  reduced  from  0(  N1  N2 )  to 
0(A/t)  with  minimal  effect  on  throughput: 

-  Cycles  for  A//4  array  (no  factorization)  =  A//4  +  1 

-  Cycles  for  N1  /4  array  =  N1  (N1  /4  +  1 )  +  N1  (N1 14  +  1 )  +  twiddle  mult »  NI2 

•  Can  do  2-D  DFT  by  not  performing  twiddle  multiplication 


Use  base-4  DFT  mapping  to  do  all  row/column  DFTs 


Base-4  Factorization  Architecture 


Two  Space-Time  Views  y 
(only  two  of  N1  iterations  shown) 


DFT  Input 


N  =  1024  points 

N  =  N1  *  N2 

N1  =  N2  =  32 

Uses  both  of  the 
two  optimal 
systolic  designs 

Twiddle 

multiplications  not 
shown 

Throughput/latency 
optimal  except  for 
interstage  delay 


Two  DFT  Architectures  Combined 


Output  Data  (Z) 
(N  Words) 


Input  Data  (X) 
(N  words) 


•  Shown  for  N  =  1024  points 

•  N-N1*N2 

•  N1  =  N2  =  32 

•  M  =  512  bits  (16  bit  word) 


□  Processing  Element  1:  2  registers,  1  adder 
0  Memory 

□  Multiplier 

□  Processing  Element  2:  2  registers,  1  adder 
I**  Local  data  flow  bus 


1st  to  2nd  Stage  Data  Formatting  Problem 

(32  Point  DFT) 


DFT  data  positions  of  1st  stage  output  sequences 
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Desired  data  positions  for  input  sequences  to  2nd  stage 
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Interstage  Data  Formatting  via  “On-the-Fly” 

Permutations 


•  New  code  with  matrix  rotation  steps 

for  n  to  N 

for  j  to  N/4  do 
for  k  to  N/4  do 

Y [  j  ,1c]  :=  WM[  j  ,k]  *add(CMl  [  j  ,i]  *X[i,k]  ,i=l .  .b)  od; 

for  k  to  b  do 

Z[k,j]  :=  add(CM2 [k,i] *Y[ j ,i] ,i=l . .N/4)  od; 

WM  :=  matrix_rotate (WM, "up" ) ; 

CMl  : =  matrix_rotate (CMl , "down" ) ; 

if  n  mod(b)=0  then  CM2  :=  matrix_rotate (CM2 , "down" )  fi; 

od; 

od; 


•  New  DFT  first  stage  output  sequences 
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1-D  DFT  Performance  Estimates 


FFT 

Size 

Throughput 

(cycles/DFT) 

Throughput 

(psec/DFT) 

Multipliers 

Adders 

256 

210 

1.0 

4 

32 

512 

274 

1.3 

8 

64 

1024 

671 

3.1 

8 

64 

2048 

914 

4.3 

16 

128 

4096 

2322 

10.8 

16 

128 

8192 

3346 

15.6 

32 

256 

Based  on: 

•  Register  transfer  level  behavioral  simulation  of  1024  point  DFT 

•  Partially  populated  layout 

•  Timing  analysis  using  Altera  Stratix  EP1S60  FPGA  chip 

•  16  bit  fixed-point  word  length 


Latency 


•  Base-4  FFT  pipeline  depth  is  nominally  A/y/4+  9  «  N 


•  Latency  (cycles)  =  1/Throughput  (cycles-1)  when  complete  X 
available 


Partitioning  to  Scale  Computations  to  Application 


Use  an  array  “section”  to  perform 
partially  processed  result 

Partial  results  accumulated  at  output 

Memory  needed  scales  with  partition  size 


Fully  Parallel  Array 


Partitioned  Array 


Non-Square  2-D  Inputs  ( N1  *  N) 


Example:  512-point  FFT  (N1  =  32,  N2 

On-the-fly  permutations  for  correct 
data  placement 


=  16) 


A//4 


Output  Data  (Z) 
(16  16-point  DFTs) 


Input  Data  (X) 
16  32-point  DFTs 


N2/4 


Output  Data  (Z) 
(16  16-point  DFTs) 
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Coefficients 

Coefficients 

(CM2) 

(CM2) 

Columns:  Compute  16  32-point  DFTs  Rows:  Compute  2  sets  of  16  16-point  DFTs 


Example  Resource  Usage+:  1024  Point  DFT 
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Cells 

Flip 

flops 
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Blocks 

Global 

Clocks 

Usage 

14717 

9200 

64 

32 
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Percent 
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26 

15 

11 

11 

44 

17 

t  Altera  Stratix  EP1S60F1508C6  FPGA  chip  (16  bit  fixed  point) 


Base-4  DFT  Architecture  Summary 


•  High  performance  1-D  and  2-D  DFTs 

•  Based  on  latency  and  throughput  optimal  parallel  circuits 

•  T ransform  size  not  restricted  to  N  =  r™ 

•  Latency  «1  /throughput  when  entire  input  block  available 

•  Architecture  is  scaleable  and  easily  parameterized 

•  Design  is  simple,  regular,  local  and  synchronous 

•  Fast  convolutions  naturally  supported 

•  Natural  partitioning  strategies  exist 

•  Pseudo-linear  architecture  good  fit  to  latest  generation  of 
FPGA  chips 
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•  “Automatic  Generation  of  Systolic  Array  Designs  For 
Reconfigurable  Computing”  ,  Proc.  Engineering  of 
Reconfigurable  Systems  and  Algorithms  (ERSA  '02), 
International  Multiconference  in  Computer  Science,  Las 
Vegas,  Nevada,  June  24,  2002. 

-  General  description  of  SPADE 

-  Faddeev  algorithm  (Find  CX+D,  given  AX=B,  X  is 
unknown) 

•  Constraint  Directed  CAD  Tool  For  Automatic  Latency-Optimal 
Implementations,  SPIE  ITCom  2002,  Boston,  Massachussetts, 
July  29-August  2,  2002. 

-  Use  of  constraints  as  a  filter  of  systolic  designs 

-  2-D  Discreet  Fourier  transforms  using  base-4  architecture 


